Modeling Using IBM Debater® Thematic Clustering of Sentences

Using the IBM Debater® Thematic Clustering of Sentences dataset, you will create a K-Means clustering model using the library sklearn to dynamically group sentences by their theme. As different sentences are input into the model, the number of groups and the themes of the groups change accordingly. Using the data you cleaned in Part 1 - Data Exploration & Visualization, you will test this clustering model. Finally, an example of a possible real world use case of this model is presented.

The dataset contains 692 articles from Wikipedia, where the number of sections (clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Table of Contents

0. Prerequisites

Before you run this notebook complete the following steps:

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

ws-project.mov

Import required modules

1. Load Data

In the first notebook (Part 1 - Data Exploration & Visualization), you modified the original dataset and saved two files as data assets to your Watson Studio project. Let's load these files.

Note: if you haven't yet run the first notebook, run it first; otherwise the cells below will not work.

Now you can get the two datasets.

Convert groups_of_themes to a list of groups (list_of_groups) for ease of use in section 2.

Now you are ready to create a model.

2. Modeling

In this section, you will create the clustering model. The model created will be evaluated using the processed data just loaded in section 1.

To understand the model, it may be important to understand these definitions:

Now to create the clustering model, you will use the following steps:

  1. Create a TF-IDF matrix using the input text. You will notice that in TfidfVectorizer there are parameters:
    • max_df=0.75 means that all terms in more than 75% of documents (the documents means the entire text input) will be ignored
    • min_df=0.1 means that all terms in less than 10% of documents will be ignored
    • stop_words='english' means Sklearn's stop words are removed from the input text
    • ngram_range = (1,3) means that unigrams, bigrams, and trigrams are used
  2. Since a KMeans finds clusters based on distances, the TF-IDF matrix is used as input to the K-Means model.
    • Number of clusters is the input to the model.
    • If no number of clusters is input then determine the best number of clusters using silhouette scores.
  3. Additionally, a function is written to extract the top terms used to cluster the text.

The following function get_top_n_terms_per_cluster() does step 3 and will be called in run_model(). The purpose of this function is to get the top terms that were used to cluster the text so that someone can view and better understand any patterns.

The next function run_kmeans() does step 2 and is also called in run_model(). It uses Python library sklearn to create a KMeans model (a type of clustering model). KMeans essentially uses the distance between points (each comment or text) to find the best clusters. The distances are calucated using a TF-IDF matrix. This TF-IDF matrix is created in run_model().

Finally, run_model() puts together the entire process described earlier by:

  1. Creating TF-IDF matrix from the input text (shown in the first few lines of the method),
  2. Running KMeans model (by calling run_kmeans()), and
  3. Get the top terms used to create the clusters (by calling get_top_n_terms_per_cluster()).

This method returns

  1. best_clusters which is a list of the cluster labels for each text
    • e.g. [0, 1, 1, 2, 0] means that there were 5 input texts and the model created 3 clusters
  2. cluster_terms which is a dictionary mapping the cluster label to the top terms used to create that cluster.
    • e.g. {0: ['money', 'price'], 1: ['customer', 'service'], 2: ['online', 'web']}

3. Testing

Next, test the clustering model with the dataset downloaded in section 1 and preprocessed in the first notebook.

Evaluate the model using sklearn's implementation of V-measure. To help understand V-measure, here are some definitions:

Additionally, a baseline model is created that predicts random clusters for each input. You want our clustering model defined in section 2 to do better than this baseline.

Let's run run the model from section 2 to test the data loaded in section 1. Then use the baseline model of randomly choosen clusters on the test data. Once done, print out the results.

As mentioned previously, the V-Measure can be used to evaluate a clustering model by comparing the true clustering labels and predicted clustering labels. The higher the V-Measure is, the better the clustering model is, i.e. the closer the model is to perfectly labeling the data.

Now let's tests how well our model clusters the sentences using the data loaded in the beginning of this notebook.

You can see that although our model does not perfectly assign the correct labels (the Test V Measure is < 1), it actually does do much better job than randomly picking clusters. The Test V Measure is higher than the Baseline V Measure, in fact the model is about 3x better at assigning labels than the baseline model.

4. Example

Finally, let's use a sample of comments a retail company could see and run the model on it. To test out your own comments:

  1. Define your comments e.g. comments = ['comment1', 'comment2', ...],
  2. Use run_model(comments), which will return best_labels and top_terms. e.g. best_labels, top_terms = run_model(comments)
  3. Follow the example below to use print_clustering_result() to print out the groups in a more human readable way

The following method helps to print out clustering results in a more interpretable way.

Using the example comments (comments_5 and comments_20), you can see how the model decided to cluster the sentences. These example input comments are examples of potential feedback a retail company could receive.

Example 1

In the first example above (comments_5), the comments are split into 2 groups. In one group, the three comments seem to be related to prices and clothing, thus the top terms are 'expensive' and 'socks'. For the other group, the theme seems to be about service, which is shown as a top term for that group.

Example 2

In this second example (comments_20), the parameter number_of_clusters was used to tell the model to output 3 clusters. In one group the theme of the sentences focuses on prices, in another group the theme is centered around customer service, and in the final group the theme is about the clothes.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.


Copyright © 2021 IBM. This notebook and its source code are released under the terms of the MIT License.